Bayesian Machine Learning


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Bayesian Classifier (or Bayesian Decision Theory)¶

Given the height $x$ of a person, decide whether the person is male ($y=1$) or female ($y=0$).

  • Binary Classes: $y\in \{0,1\}$
$$ \begin{align*}P(y=1 \mid x) &=\frac{P(x \mid y=1)P(y=1)}{P(x)} =\frac{ \underbrace{P(x \mid y=1)}_{\text{likelihood}} \underbrace{P(y=1)}_{\text{prior}}}{\underbrace{P(x)}_{\text{marginal}}} \\ P(y=0 \mid x) &=\frac{P(x \mid y=0)P(y=0)}{P(x)} \end{align*}$$
  • Decision
$$ \begin{align*} \text{If} \; P(y=1 \mid x) > P(y=0 \mid x),\; \text{then} \; \hat{y} = 1 \\ \text{If} \; P(y=1 \mid x) < P(y=0 \mid x),\; \text{then} \; \hat{y} = 0 \end{align*} $$


$$\therefore \; \frac{P(x \mid y=0)P(y=0)}{P(x \mid y=1)P(y=1)} \quad\begin{cases} >1 \quad \implies \; \hat{y}=0 \\ =1 \quad \implies \; \text{decision boundary}\\ <1 \quad \implies \; \hat{y}=1 \end{cases}$$

1.1. Equal variance and equal prior¶

$$ \sigma_0 = \sigma_1 \qquad \text{and} \qquad P(y=0)=P(y=1)=\frac{1}{2}$$$$P(x) = P(x \mid y=0) P(y=0) + P(x \mid y=1)P(y=1) = \frac{1}{2}\left\{ P(x \mid y=0) + P(x \mid y=1)\right\}$$
  • Decision Boundary
$$P(y=0 \mid x)=P(y=1 \mid x)$$



InĀ [1]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy.stats import norm
from matplotlib import animation
% matplotlib inline
InĀ [4]:
x = np.arange(130,200)
print(x)
[130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199]
InĀ [8]:
x = np.arange(130,200)

mu0 = 160
mu1 = 173
sigma0 = 6.5
sigma1 = 6.5

L1 = norm.pdf(x,mu0,sigma0)
L2 = norm.pdf(x,mu1,sigma1)

prior0 = 1/2
prior1 = 1/2

Px = L1*prior0 + L2*prior1
posterior0 = (L1*prior0)/Px
posterior1 = (L2*prior1)/Px

var1 = posterior1 - posterior1**2;

plt.figure(figsize=(10,9))
plt.subplot(3,1,1)
plt.plot(x,L1,'r',x,L2,'b',x,Px,'k')
plt.axis([130, 200, 0, 0.1])
plt.text(mu0-10,0.08,'P(x|y=0)', color='r', fontsize=15)
plt.text(mu1,0.08,'P(x|y=1)', color='b', fontsize=15)
plt.text(185,0.03,'P(x)', fontsize=15)

plt.subplot(3,1,2), 
plt.plot(x,L1*prior0,'r',x,L2*prior1,'b')
plt.axis([130, 200, 0, 0.1])
plt.text(mu0-10,0.05,'P(x|y=0)P(y=0)', color='r', fontsize=15)
plt.text(mu1,0.05,'P(x|y=1)P(y=1)', color='b', fontsize=15)

plt.subplot(3,1,3),  
plt.plot(x,posterior0,'r',x,posterior1,'b',x,var1,'k')
plt.axis([130, 200, 0, 1.1])
plt.text(140,0.8,'P(y=0|x)', color='r', fontsize=15)
plt.text(180,0.8,'P(y=1|x)', color='b', fontsize=15)
plt.text(170,0.4,'var(y|x)', fontsize=15)
plt.show()

1.2. Equal variance and not equal prior¶

$$ \sigma_0 = \sigma_1 \qquad \text{and} \qquad P(y=1) > P(y=0) $$



InĀ [10]:
x = np.arange(130,200)

mu0 = 160
mu1 = 173
sigma0 = 6.5
sigma1 = 6.5

L1 = norm.pdf(x,mu0,sigma0)
L2 = norm.pdf(x,mu1,sigma1)

prior0 = 1/4
prior1 = 3/4

Px = L1*prior0 + L2*prior1
posterior0 = (L1*prior0)/Px
posterior1 = (L2*prior1)/Px

var1 = posterior1 - posterior1**2;

plt.figure(figsize=(10,9))
plt.subplot(3,1,1)
plt.plot(x,L1,'r',x,L2,'b',x,Px,'k')
plt.axis([130, 200, 0, 0.1])
plt.text(mu0-10,0.08,'P(x|y=0)', color='r', fontsize=15)
plt.text(mu1,0.08,'P(x|y=1)', color='b', fontsize=15)
plt.text(185,0.03,'P(x)', fontsize=15)

plt.subplot(3,1,2), 
plt.plot(x,L1*prior0,'r',x,L2*prior1,'b')
plt.axis([130, 200, 0, 0.1])
plt.text(mu0-10,0.05,'P(x|y=0)P(y=0)', color='r', fontsize=15)
plt.text(mu1,0.05,'P(x|y=1)P(y=1)', color='b', fontsize=15)

plt.subplot(3,1,3),  
plt.plot(x,posterior0,'r',x,posterior1,'b',x,var1,'k')
plt.axis([130, 200, 0, 1.1])
plt.text(140,0.8,'P(y=0|x)', color='r', fontsize=15)
plt.text(180,0.8,'P(y=1|x)', color='b', fontsize=15)
plt.text(170,0.4,'var(y|x)', fontsize=15)
plt.show()

1.3. Not equal variance and equal prior¶

$$ \sigma_0 \ne \sigma_1 \qquad \text{and} \qquad P(y=0)=P(y=1)=\frac{1}{2}$$



InĀ [20]:
x = np.arange(130,200)

mu0 = 160
mu1 = 173
sigma0 = 6.5
sigma1 = 15

L1 = norm.pdf(x,mu0,sigma0)
L2 = norm.pdf(x,mu1,sigma1)

prior0 = 1/2
prior1 = 1/2

Px = L1*prior0 + L2*prior1
posterior0 = (L1*prior0)/Px
posterior1 = (L2*prior1)/Px

var1 = posterior1 - posterior1**2;

plt.figure(figsize=(10,9))
plt.subplot(3,1,1)
plt.plot(x,L1,'r',x,L2,'b',x,Px,'k')
plt.axis([130, 200, 0, 0.045])
plt.text(mu0-20,0.03,'P(x|y=0)', color='r', fontsize=15)
plt.text(mu1,0.03,'P(x|y=1)', color='b', fontsize=15)
plt.text(158,0.027,'P(x)', fontsize=15)

plt.subplot(3,1,2), 
plt.plot(x,L1*prior0,'r',x,L2*prior1,'b')
plt.axis([130, 200, 0, 0.045])
plt.text(mu0-28,0.015,'P(x|y=0)P(y=0)', color='r', fontsize=15)
plt.text(mu1+5,0.015,'P(x|y=1)P(y=1)', color='b', fontsize=15)

plt.subplot(3,1,3),  
plt.plot(x,posterior0,'r',x,posterior1,'b',x,var1,'k')
plt.axis([130, 200, 0, 1.1])
plt.text(143,0.9,'P(y=0|x)', color='r', fontsize=15)
plt.text(180,0.8,'P(y=1|x)', color='b', fontsize=15)
plt.text(173,0.35,'var(y|x)', fontsize=15)
plt.show()

Examples of Gaussian decision regions

  • The separating surfaces are piecewise level sets of quadratic functions.

  • When the covariances are all equal, the separating surfaces are hyperplanes.

Lecture 23 in Learning Theory (Reza Shadmehr, Johns Hopkins University)

InĀ [2]:
%%html
<center><iframe src="https://www.youtube.com/embed/3r5SlvjJptM?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
InĀ [3]:
%%html
<center><iframe src="https://www.youtube.com/embed/X1WB6IJqMjM?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

2. Bayesian Inference¶

  • Start with prior beliefs, which can be thought of as a summary of opinions.

    • might be subjective
  • Given our prior, we can update our opinion, and produce a new opinion.

    • This new distribution is called the posterior
  • Iterate

    • if more data is available

  • Coin-flip example

  • See how two different priors can result in two different posteriors

  • Gaussian distributions
    • Explaining away

InĀ [11]:
%%html
<center><iframe src="https://player.vimeo.com/video/72116712" 
width="640" height="360" frameborder="0" allowfullscreen></iframe></center>

3. Bayesian Density Estimation¶

InĀ [5]:
%%html
<center><iframe src="https://www.youtube.com/embed/J5uQcuQ_fJ0?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
  • Estimate a probability density function of a hidden state from multiple observations


  • $H$: Hypothesis, hidden state
  • $D = \{d_1,d_2,\cdots,d_m\}$: data, observation, evidence
$$ P(H,D) = P(H \mid D)P(D) = P(D \mid H)P(H) $$
  • Goal: $$ P(H \mid D) = \frac{P(D \mid H)P(H)}{P(D)}: \quad \text{ Bayes' Rule} $$







InĀ [6]:
%%html
<center><iframe src="https://www.youtube.com/embed/w1u6-_2jQJo?rel=0" 
width="420" height="315" frameborder="0" allowfullscreen></iframe></center>

3.1. Combining Multiple Evidences¶

  • Assume conditional independence (Markovian Property)
$$ \begin{align*} P(H \mid \underbrace{d_1,d_2,\cdots,d_m}_{\text{multiple evidences}}) &= \frac{P(d_1,d_2,\cdots,d_m \mid H) \; P(H)}{P(d_1,d_2,\cdots,d_m)} \\ \\ &=\frac{P(d_1 \mid H)P(d_2 \mid H)\cdots P(d_m \mid H) \;P(H)}{P(d_1,d_2,\cdots,d_m)} \\ \\ &= \eta\prod\limits_{i=1}^m P(d_i \mid H)P(H), \qquad \eta: \text{normalizing} \end{align*}$$

3.2. Recursive Bayesian Estimation¶

  • two identities
$$ \begin{align*}P(a,b) &= P(a \mid b) P(b) \\ P(a,b \mid c) & = P(a \mid b,c) P(b \mid c) \end{align*}$$
  • When multiple $d_1, d_2, \cdots $
$$ \begin{align*} P(H\mid d_1) &= \frac{P(d_1 \mid H) P(H)}{P(d_1)} = \eta_1 \, P(d_1 \mid H) \underbrace{P(H)}_{\text{prior}} \\ P(H\mid d_1 d_2) &= \frac{P(d_1 d_2\mid H) P(H)}{P(d_1 d_2)} = \frac{P(d_1 \mid H)P(d_2 \mid H) P(H)}{P(d_1 d_2)} = \eta_2 \, P(d_2 \mid H) \underbrace{P(H\mid d_1)}_{\text{acting as a prior}} \\ & \quad \vdots \\ \\ P(H \mid d_1,d_2,\cdots,d_m) &= \eta_m \,P(d_m \mid H) \underbrace{P(H \mid d_1,d_2,\cdots,d_{m-1})}_{\text{acting as a prior}} \end{align*} $$
  • Recursive
$$P_0(H) = P(H) \implies \; P(H \mid d_1)=P_1(H) \implies \; P(H \mid d_1 d_2) = P_2(H) \implies \; \cdots$$
  • Recursive Bayesian Estimation

3.3. Example in 1D¶



Given model

InĀ [10]:
H = 50         # if location of unknown object = 50
sigma = 20     # assume normal dist w/ sigma = 20

n = 100
x = np.arange(0,100)
likelihood = norm.pdf(x,H,sigma)
likelihood = likelihood/np.sum(likelihood) #normalizing

plt.figure(figsize=(8,6))
plt.plot(x,likelihood), plt.axis('tight'), plt.ylim([0,0.025])
plt.xlabel('D', fontsize=15)
plt.ylabel('P(D|H)', fontsize=15)
plt.title('H = {}'.format(H), fontsize=20)
plt.show()

Likelihood

InĀ [11]:
L = np.zeros([n,n]) # likelihood

for h in range(n):
    L[:,h] = norm.pdf(x,h,sigma)
    L[:,h] = L[:,h]/np.sum(L[:,h])

plt.figure(figsize=(8,6))
plt.imshow(L, extent=[0,L.shape[0],L.shape[1],0]), plt.colorbar()
plt.xlabel('H', fontsize=15), plt.ylabel('D', fontsize=15), plt.title('Pr(D|H)', fontsize=20)
plt.show()
InĀ [12]:
fig = plt.figure(figsize=(8,6))
ax = fig.gca(projection='3d')
X,Y = np.meshgrid(x,x)
surf = ax.plot_surface(X,Y,L, cmap='viridis', antialiased=False)
ax.view_init(28,-30)
plt.show()                      

Suppose $H = 45$

InĀ [13]:
## run it to see animation

% matplotlib qt

n = 100
x = np.arange(0,n)

sigma = 20
L = np.zeros([n,n]) # likelihood

for h in range(n):    
    L[:,h] = norm.pdf(x,h,sigma).T
    L[:,h] = L[:,h]/np.sum(L[:,h])
    

fig = plt.figure(figsize=(8, 6))

ax1 = fig.add_subplot(1, 1, 1)

graph1, = ax1.plot([], [], 'b')
graph2, = ax1.plot([], [], 'r+')

ax1.set_xlim(0, 100)
ax1.set_ylim(-0.01,0.15)

H = 45

pr = np.ones(n)/n # prior = uniform 
dh = []
 
def init():
    graph1.set_data([], [])
    graph2.set_data([], [])
    return graph1, graph2,

def animate(i):
    global pr, po
    
    if i > 50: anim.event_source.stop()
        
    d = int(np.round(np.random.normal(H,sigma)))    # observation
    
    # make sure 1 <= d <= 100
    if d <= 0: d = 1
    elif d >= 100: d = 100
    
    dh.append(d)
    
    graph1.set_data(x,pr)
    graph2.set_data(dh,np.zeros(np.size(dh)))
    
    po = L[d,:]*pr
    po = po/np.sum(po)
    
    pr = po
    
    return graph1, graph2,

anim = animation.FuncAnimation(fig, animate, init_func=init, interval=100, blit=True)
plt.show()
InĀ [14]:
print(np.mean(dh))
print('I = {}'.format(np.argmax(po)))
40.9038461538
I = 39
InĀ [15]:
% matplotlib inline
plt.figure(figsize=(8,6))
plt.hist(dh,21), plt.xlim([0,100])
plt.title('Histogram of D', fontsize=20)
plt.show()

3.4. Example in 2D¶

InĀ [7]:
%%html
<center><iframe src="https://www.youtube.com/embed/qsLF3KgavJk?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>
InĀ [8]:
%%html
<center><iframe src="https://www.youtube.com/embed/dhsK-mRzxFw?rel=0" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

iSystems Demo

InĀ [10]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')